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Abstract 

D ' The Reverse Greedy algorithm (RGreedy) for the /c-median problem works as follows. 

It starts by placing facilities on all nodes. At each step, it removes a facility to minimize 
' the total distance to the remaining facilities. It stops when k facilities remain. We prove 

■ that, if the distance function is metric, then the approximation ratio of RGreedy is between 

' f2(logn/loglogn) and O(logn). 
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1 Introduction 

O ■ 

■^j- ■ An instance of the metric k-median problem consists of a metric space X = (X, c), where X is a 

set of points and c is a distance function (also called the cost) that specifies the distance c xy > 
between any pair of nodes x, y £ X. The distance function is reflexive, symmetric, and satisfies the 
triangle inequality. Given a set of points FCI, the cost of F is defined by cost(F) = J2 X £X c xF-> 
where c x p = mmf e p c x f for ifl, Our objective is to find a ^-element set F Q X that minimizes 
cost(F). 

Intuitively, we think of F as a set of facilities and of c x p as the cost of serving a customer 
at x using the facilities in F. Then cost(F) is the overall service cost associated with F. The 
/^-element set that achieves the minimum value of cost(F) is called the k-median of X. 

The /c-median problem is a classical facility location problem and has a vast literature. Here, 
we review only the work most directly related to this paper. The problem is well known to be NP- 
hard, and extensive research has been done on approximation algorithms for the metric version. 
Arya et al. 1 show that the optimal solution can be approximated in polynomial time within 
ratio 3 + e, for any e > 0, and this is the smallest approximation ratio known. Earlier, several 
approximation algorithms with constant, but somewhat larger approximation ratios appeared in 
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the works by Charikar et al. Charikar and Guha [I], and Jain and Vazirani [Sj. Jain et al. [7] 
show a lower bound of 1 + 2/e on the approximation ratio for this problem (assuming P^NP). 

In the oblivious version of the A;- median problem, first studied by Mettu and Plaxton [Hj, the 
algorithm is not given k in advance. Instead, requests for additional facilities arrive over time. 
When a request arrives, a new facility must be added to the existing set. In other words, the 
algorithm computes a nested sequence of facility sets F% C F2 C • • • C F n , where \Fk\ = k for 
all k. This problem is called online median in incremental median in JU], and the analog 
version for clustering is called oblivious clustering in [2J|3]. The algorithm presented by Mettu and 
Plaxton |2j guarantees that cost(Fk) approximates the optimal /c-median cost within a constant 
factor (independent of k.) They also show that in this oblivious setting no algorithm can achieve 
approximation ratio better than 2 — 2/(n — 1). 

The naive approach to the median problem is to use the greedy algorithm: Start with Fq = 0, 
and at each step k = 1, . . . , n, let Fk = i*jfc~i U {/*:}, where /& G X — F^-i is chosen so that 
cost{Fk) is minimized. Clearly, this is an oblivious algorithm. It is not difficult to show, however, 
that its approximation ratio is Q(n). 

Reverse Greedy. Amos Fiat 6 proposed the following alternative idea. Instead of starting 
with the empty set and adding facilities, start with all nodes being facilities and remove them 
one by one in a greedy fashion. More formally, Algorithm RGreedy works as follows: Initially, 
let R n = X. At step k = n, n — 1,. . . ,2, let Rk-i = Rk ~ { r fc} ; where G R^ is chosen so 
that cost(Rk_i) is minimized. For the purpose of oblivious computation, the sequence of facilities 
could be precomputed and then produced in order (n, r^, ■ ■ ■ ,r n ). 

Fiat asked whether RGreedy is an 0(l)-approximation algorithm for the metric fe-median 
problem. In this note we present a nearly tight analysis of RGreedy by showing that its approx- 
imation ratio is between fi(log nj log log n) and O(logn). Thus, although its ratio is not constant, 
RGreedy performs much better than the forward greedy algorithm. 

2 The Upper Bound 

One crucial step of the upper bound is captured by the following lemma. 

Lemma 2.1 Consider two subsets R and M of X. Denote by Q the set of facilities in R that 
serve M, that is, a minimal subset of R such that c^q = c^r for all fi € M. Then for every x G X 
we have c x q < 2c x m + c x r. 

Proof: For any x G X, choose r G R and /i G M that serve x in R and M, respectively. 
In other words, c x r = c XT and c x m = c XfM . We have c^ r > c^q, by the definition of Q. Thus 
C X Q ^ C X fj, -\- Cf^Q ^ C x ^i + Cfj, r ^ 1c X j i -\- c xr — 2c x ^t ~\~ C X R. n 

Now, fix k and let M be the optimal fe-median of X. Consider a step j of RGreedy (when 
we remove rj from Rj to obtain for j > k. Denote by Q the set of facilities in Rj that 
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serve M. We estimate first the incremental cost in step j: 



cost(Rj-i) — cost(Rj) < min cost(Rj \ {r}) — cost(Rj) (1) 

r€Rj\Q 

^ \n~\Q~\ E [costiRjMr^-costiR,)} (2) 

^ 73fc E H(^j\W)-™W] (3) 
reRj\Q 

< -\cost(Q) - cost(Ri)} (4) 

i - k 

< —-costiM). (5) 
j - k 

The first inequality follows from the definition of Rj-i, in the second one we estimate the minimum 
by the average, and the third one follows from \Q\ < k. We now justify the two remaining 
inequalities. 

Inequality (|1J) is related to the the super-modularity property of the cost function. We need 
to prove that 

^2 [cost(R \ {r}) - cost(R)} < cost(Q) - cost(R), 

r£R\Q 

where R = Rj. To this end, we examine the contribution of each x G X to both sides. The 
contribution of x to the right-hand side is exactly c x q—c x r. On the left-hand side, the contribution 
of x is positive only if c x q > c x r and, if this is so, then x contributes only to one term, namely the 
one for the r G R\Q that serves x in R (that is, c xr = c x r). Further, this contribution cannot be 
greater than c x q — c x r because Q C R\ {r}. (Note that we do not use here any special properties 
of Q and R. This inequality holds for any Q C R C X.) 

Finally, to get ©, we apply Lemma 12.11 to the sets R = Rj, M, and Q, and sum over all 
x G X. 

We have thus proved that cost(Rj-i) — cost(Rj) < -p^cost(M). Summing up over j = 
n, n — 1, . . . , k + 1, we obtain our upper bound. 

Theorem 2.2 The approximation ratio of Algorithm RGreedy in metric spaces is at most 
2H n _ k = 0{\ogn). 



3 The Lower Bound 

In this section we construct an n-point metric space X where, for k = 1, the ratio between the 
cost of the RGreedy's facility set and the optimal cost is Vt (log n/ log log n). (For general k, a 
lower bound of f2(log(n/fc)/ log log(n/k)) follows easily, by simply taking k copies of X.) 

To simplify presentation, we allow distances between different points in X to be 0. These 
distances can be changed to some appropriately small e > without affecting the asymptotic 
ratio. Similarly, whenever convenient, we will break the ties in RGreedy in our favor. 

Let T be a graph that consists of a tree T with root p and a node /i connected to all leaves of 
T. T itself consists of h levels numbered 1,2, ... ,h, with the leaves at level 1 and the root p at 
level h. Each node at level j > 1 has (j + l) 3 children in level j — 1. 
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To construct X , for each node x of T at level j we create a cluster of Wj = j! 3 points (including 
x itself) at distance from each other. Node p is a 1-point cluster. All other distances are defined 
by shortest-path lengths in T. 

First, we show that, for k = 1, RGreedy will end up with the facility at p. Indeed, RGreedy 
will first remove all but one facility from each cluster. Without loss of generality, let those 
remaining facilities be located at the nodes of T, and from now on we will think of Wj as the 
weight of each node in layer j. At the next step, we break ties so that RGreedy will remove the 
facility from p. 

We claim that in any subsequent step t, if j is the first layer that has a facility, then RGreedy 
has a facility on each node of T in layers j + 1, . . . , h. To prove it, we show that this invariant 
is preserved in one step. If a node x in layer j has a facility then, by the invariant, this facility 
serves all the nodes in the subtree T x of T rooted at x, plus possibly p (if x has the last facility 
in layer j.) What facility will be removed by RGreedy at this step? The cost of removing any 
facility from layers j + 1, . . . , h is at least Wj + \. If we remove the facility from x, all the nodes 
served by x can switch to the parent of x, so the increase in cost is bounded by the total weight 
of T x (possibly plus one, if x serves p.) T x has (j + l)! 3 /(i + l)! 3 nodes in each layer i < j. So 
the total weight of T x is 

w(T x ) = ^^-(j + l)! 3 /(* + l)! 3 
i=i 



(j + i)! 3 + 



i=l 

< (J + 1)! 3 

= Wj+l, 

where the inequality above follows from X]i=i(* + -0~ 3 — Y1^2 ^ 1- Thus removing x increases 
the cost by at most w(T x ) + 1 < Wj+i, so RGreedy will remove x or some other node from layer 
j in this step, as claimed. Therefore, overall, after n — 1 steps, RGreedy will be left with the 
facility at p. 

By the previous paragraph, the cardinality (total weight) of X is n = w(T) + 1 < (h + l)! 3 , 
so h = f2(logn/loglogn). The optimal cost is 

h 



cost(p) = ^i-^-(/i + l)! 3 /(i + l)! 3 
i=i 

h 

= (/ l + l)! 3 ^i(i + l)- 3 



i=i 

oo 

< (/i + i)! 3 J^r 2 

i=2 

< (h + l)l 3 , 
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while the cost of RGreedy is 

h 

cost(p) = ^2(h-i) -wi- (h + l)! 3 /(i + 1)! 3 
»=i 

h 

= (h + i)i 3 ^2(h-i)(i + iy 3 
i=i 

> (/i-l)(/i + l)! 3 /8, 

where in the last step we estimate the sum by the first term. Thus the ratio is cost(p) / cost(p) > 
(h — l)/8 = fi(logn/loglogn). 

In the argument above we considered only the case k = 1. More generally, one might char- 
acterize the performance ratio of the algorithm as a function of both n and k. Any lower bound 
for k = 1 implies a lower bound for larger k by simply taking k (widely separated) copies of the 
metric space. Therefore we obtain: 

Theorem 3.1 The approximation ratio of Algorithm RGreedy in metric spaces is not better 
than Q(\og(n/k)/ log log(n/fe)). 



4 Technical Observations 

We have shown an O(logn) upper bound and an f2(logra/loglogn) lower bound on the approxi- 
mation ratio of RGreedy for fc-medians in metric spaces. Next we make some observations about 
what it might take to improve our bounds. We focus on the case k = 1. 

Comments on the upper bound. In the upper bound proof in Section [2] we show that the 
incremental cost of RGreedy when removing r,- from Rj to obtain Rj-i is at most 2cost(p) 
where /i denotes the optimal 1-median. The proof (inequalities Q through (jSJj) doesn't use any 
information about the structure of Rj: it shows that for any set R of size j, 



min cost(R \ {r}) — cost(R) < 



2cost(fi) 
7-1 



(6) 



Next we describe a set R of size j in a metric space for which this latter bound is tight. The 
metric space is defined by the following weighted graph: 




(weight w) 



yi yi 



y; 



y.i 



R 
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The space has points [i, x\, . . . , Xj, y±, . . . , yj, where the points Xi have weights w, for some large 
integer w. (In other words, each X{ represents a cluster of w points at distance from each other.) 
All other points have weight 1. Point u is connected to each x\ by an edge of length 1. Each x% 
is connected to yi by an edge of length 1, and to each yi, for / ^ £, by an edge of length 2. The 
distances are measured along the edges of this graph. 

For k = 1, the optimal cost is cost(fi) = j(w + 2). Now consider R = {yi, . . . , yj}. Removing 
any yi £ R increases the cost by w ~ cost(/i)/j. Thus, for this example, inequality (JHJ) is tight, 
up to a constant factor of about 2. 

Of course, RGreedy would not produce the particular set R assumed above for Rj. Also, 
this example only shows a single iteration where the incremental cost matches the upper bound 
(Jp|. Nonetheless, the example demonstrates that to improve the upper bound it is necessary to 
consider some information about the structure of Rj (due to the previous steps of RGreedy). 

Comments on the lower bound. We can show that the lower-bound constructions similar 
to that in Section are unlikely to give any improvement, in a technical sense formalized in 
Lemma 14. II 

Fix a metric space X = (X, c) with n points, where n is a large integer. Let n be the 1-median 
of X , and assume (by scaling) that its cost is cost(/j,) = n/2. Let B be the unit ball around u, 
that is, the set of points at distance at most 1 from u. Note that \B\ > n/2. 

For i > 0, define Zj to be the points x G X such that i — 1 < c XfJi < i, and such that there 
is a time when x is used by RGreedy as a facility for some point in B. Thus Zq = {a} and 
Zq U Z\ = B. Also, for i < j, let Zij = U J l=i Zi. 

Let h be the maximum index for which Z^ ^ 0. Define tj to be the time step when RGreedy 
is about to remove the last facility from Zqj, and for j > 7 let rrij be the number of points served 
by Zj at time ij-6- (The value of 6 is not critical; any constant C > 6 will work, with some minor 
modifications.) 

Lemma 4.1 Suppose that X^=io^ m * = 0(n). Then, for k = 1, the approximation ratio of 
RGreedy is 0(logn/loglogn). 

Proof sketch: We will show that h = O (log nj log log n). Since the facility computed by 
RGreedy for k = 1 is at distance at most h from /i, this will imply the lemma, by the triangle 
inequality. 

We first argue that Z{ = cannot happen for more than four consecutive values of i < h. 
Indeed, Zq,Z± ^ 0. Assume, towards a contradiction, that Zi^% and that = 0. Then 

at step U, RGreedy deletes the last facility / £ Zq^, its cost to serve u increases by at least 4 
and its cost to serve B increases by more than 2\B\ > n. Let j > i + 4 be such that Zj ^ 0. By 
Lemma l2~Tl deleting a facility /' € Zj at time U would increase the cost by at most 2cosi(u) < n, 
hence less than the cost of deleting / at time ij - contradicting the definition of RGreedy. 

Now, consider any i < h — 9. It is easy to see that over all steps ti,U + 1, .., RGreedy's 
cost to serve B increases by at least \B\ > n/2, while, by the triangle inequality, all facilities that 
serve B at steps ti+i, tt+i + 1, are in Zj+i^+s. Thus, there exists ate [^,^4.3] such that at 
step t, RGreedy deletes a facility / and pays an incremental cost of at least (n/2)/(l+|2j+i,i+5|). 

Suppose Zi + g ^ 0. Since t < ^4.3, the facilities in Zj+g serve at most nij clients. Therefore, at 
step t, deleting all facilities in and serving their clients using a remaining facility from ^^4.3 
would have increased the cost by O (£77144.9), by the triangle inequality. So there exists a facility 
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/' in Zi + g whose deletion at step t would have increased the cost by 0(im,i +9 /\Zi +9 \). Since at 
time t RGreedy prefers to delete / rather than /', we have 

(n/2)/(l + |Z i+M+5 |) = 0(im i+9 /\Z i+9 \). 

Rewriting and summing the above over i (including now those i for which Zj + g is empty), 

E i J|7 +91 I = O(-E^) = 0(-jz^m l ) < A, (7) 
for some constant A. 

The intuition is that for this sum to be bounded by a constant, the cardinalities \Zi\ must 
rapidly decrease (except for some small number of abnormalities) and h cannot be too large. To 
get a good estimate, let yi = |^si+l,8i+8 1 5 for i = 1, . . . , [h/8\ — 1. Then, 

LV8J-2 LV8J-2 8i+8 ,„ , LV8J-2 8i+8 ,„ , 

Vi+i = l z i+sl < ^ l z i+8| < A 

Vi + yi+i ~ . 4tL, l^8»+i,8i+i6| ~ . 4tLi 1 + I^'j-kI 

t=l 1=1 j=8i+l t= i j=8i+1 J ' J 

where the next-to-last inequality holds because l + \Zjj + 4\ < \Z8i+l,8i+ie\ for all j = 8i + l,...,8i + 
12. (Here, again, we use the fact that at most four consecutive Z/'s can be zero.) 

Now let ft = y i+ i/yi for all * = 1, ... , [h/8\ - 2. We have E^" 2 ft/(l + ft) < A. Therefore 
ft < 1 for all except at most 2A i's. So there are m and 5 > ([h/8\ — 2)/(2A) such that ft < 1 
for all i = m, m + g — 1. For those i's we get 



m+g—1 m+g—1 m+g—1 



< 2A. 



Let J27=Vi 1 Qi = B < 2 A. Then nZLt™ 1 3* * s maximized when all ft are equal to B/g, and 
therefore 

m+g-l 

Thus (g/B) 9 < n, and we obtain /i = O(g) = O (log n/ log log n), completing the proof. □ 

Note that assumption of the lemma holds for the metric space used in Section 03 There, each 
set Zj, for i = 1, h, consists of the nodes in T at level i, and m; = (h + l)! 3 /(i + l) 3 is the 
total weight of level i so, indeed, Ya=i = 0(^! 3 ) = 0{n). The lemma suggests that in order 
to improve the lower bound, one would need to design an example where at every time t(, the 
facilities serving nodes at distance at most i from ji are distributed more or less uniformly across 
the remaining facilities. 
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